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Abstract 

In this work we investigate from a computational perspective the efficiency of 
the Willshaw synaptic update rule in the context of familiarity discrimination, a 
binary-answer, memory-related task that has been linked through psychophysical 
experiments with modified neural activity patterns in the prefrontal and perirhinal 
cortex regions. Our motivation for recovering this well-known learning prescrip- 
tion is two-fold: first, the switch-like nature of the induced synaptic bonds, as 
there is evidence that biological synaptic transitions might occur in a discrete step- 
wise fashion. Second, the possibility that in the mammalian brain, unused, silent 
synapses might be pruned in the long-term. Besides the usual pattern and network 
capacities, we calculate the synaptic capacity of the model, a recently proposed 
measure where only the functional subset of synapses is taken into account. We 
find that in terms of network capacity, Willshaw learning is strongly affected by 
the pattern coding rates, which have to be kept fixed and very low at any time 
to achieve a non-zero capacity in the large network limit. The information carried 
per functional synapse, however, diverges and is comparable to that of the pattern 
association case, even for more realistic moderately low activity levels that are a 
function of network size. 

Keywords: familiarity memory, Willshaw rule, synaptic capacity, sparse cod- 
ing 



1 Introduction 



Observations of psychophysical and neurophysiological order have brought into attention 
the so-called familiarity discrimination or detection task, where tested subjects need 
only to recognise once-seen objects without being asked to recollect detailed feature or 
context descriptions (Xiang and Brown, 1998, 2004 Yakovlev et al, 2008). From the 
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computational perspective, the essential aim is to devise a neural network model that 
is biologically plausible up to a certain degree of realism and that is able to explain in 



part the seemingly limitless memorising ability of the brain to solve this task (Standing 



1973). 



As in previous familiarity memory neural network modelling efforts (Bogacz et al 



2001 Greve et al, 2009 Cortes et al, 2010), the formulation of the task that we consider 



involves a set of M patterns 



S = {x 1 



}• 



that have been presented to the network for learning and that ought to be recognised as 
familiar in future presentations, while any other pattern not belonging to S should be 
classified as novel. Each of the patterns is a binary vector x M 6 {0, l} m , x% representing 
the (silent-firing) activity of the i-th neuron at a given time frame /i; the task itself 
is as well binary, in the sense that we seek to decide if a certain presented pattern x 
is either familiar or novel. The structure of the network is given at any time by the 
m x m connectivity matrix W, where the entry Wy denotes the strength of the bond 
from presynaptic neuron i to postsynaptic neuron j. 

To learn the desired mapping, each neuron should be able to determine at the synapse 
level ('locally') the network connectivity structure so that in subsequent pattern presen- 
tations one can extract from the collective activity of the m neurons the desired novel- 
familiar response. The model is then characterised by a local synaptic learning rule and 
by a discrimination function. On the one hand, given a pattern x M that should be mem- 
orised, the former determines each synaptic weight solely by inspection of the variables 
Wij, Xi and Xj\ the latter, given a query pattern x and the structure of the network W, 
elicits the binary familiarity response. 

We focus on modelling long-term memory, in opposition to palimpsestic working mem- 



ory ( 


Parisi 


1986 


Amit and Fusi 




1994 


Leibold and Kempter 


Rossum 


2008 


Yakovlev et al 


2008 


), where 'overwriting' takes p" 



2008; Barrett and van 



signal of past memories decays over time. For long-term familiarity detection, a model 
that is capable of storing an extensive number of patterns per synapse has been proposed 



(Bogacz et al, 2001) and recently shown to correspond to the optimal linear, local fa- 
miliarity learning prescription (Greve et al, 2009). However, the network is only capable 



of storing a rather small amount of information per synapse, and the proposed synaptic 
update scheme requires maintenance of real-valued synapses over a long period of time. 
In our work, we consider as an alternative the binary non-linear Willshaw (or Stein- 



buch) prescription (Steinbuch, 1961 Willshaw et al 1969) in the context of familiarity 
discrimination. This learning rule has certain properties that have made it desirable 
when applied to the associative memory problem, where it has been extensively analysed 



(see, e.g., 


Willshaw et al 


1969 


Palm 


1980 


Golomb et al 


1990 


Nadal and Toulouse 


1990; 


Palm and Sommer[ 


1992; 


3uckingh 


am a 
1999 


nd Willshaw, 


1992; 


Brunei 


1994 


Graham 


and Willshaw 


1995[ 


Sommer and Palm| 


Knoblauch et al 


2010 


); namely, the high 



storage capacity attained when the model is correctly parametrised, its simplicity, and 
the fact that the generated synaptic matrix W is binary. This last feature is particu- 
larly interesting since in cortical regions supporting memory-related tasks the synaptic 
transitions may operate in a discrete (few steps) or even in a binary switch-like fashion. 
There is accumulating experimental evidence supporting discrete transitions at least in 



2 



the initial phase of long-term potentiation, although it remains unclear whether or not 



long-term synaptic efficacies may still have a gradual distribution (Petersen et al, 1998 



Montgomery and Madison, 2004 O'Connor et al, 2005). 



Furthermore, an inhibitory variant of the Willshaw rule has just been proposed by 



Knoblauch et al (2010), motivated by the possibility of structural plasticity by synaptic 



pruning and growth as a support for long-term memory encoding in the adult mam- 
malian brain (Chklovskii et al, 2004), alongside well-established synaptic weight change 
mechanisms such as long-term potentiation and depression. In the associative case, the 
inhibitory Willshaw rule has led to the discovery of new efficient working regimes where 
few active synapses can carry a high Shannon information content. 

In this article we show in a first step that for medium-sized networks the classical 
pattern and Shannon capacities of the Willshaw model are comparable to those of the 
real- valued network of |Bogacz et al (2001 ), provided that the patterns exhibit low activity 
levels at any time (the so-called sparse coding regime), a fact that has already been 
pointed out in the dynamical synapse analysis of Barrett and van Rossum (2008). We 
also show that in the limit of large networks m — > oo, the network capacity vanishes 
unless the coding rates are extremely low. 

In line with the recent observations of Knoblauch et al (2010), we then investigate 
alternative parametrisations of the Willshaw model. We find that the high pattern load- 
ings associated with the familiarity discrimination task lead to dense potentiation of the 
memory matrix, a regime where the inhibitory interpretation of the original Willshaw 
model is especially efficient. It is shown that if the low cost of silent synapses (which 
might even be pruned in the long-term) is neglected, the inhibitory network is capable 
of achieving large synaptic capacities that increase with the number of neurons, under 
realistic moderately low coding rates. Finally, we take into consideration the effects of 
varying the coding level per pattern; at least when the level follows a binomial distri- 
bution, introducing a feedforward inhibitory correction in the discriminator compensates 
for the additional signal variability and the system remains qualitatively intact, albeit 
operating with lower overall efficiency in the finite-size case. 



2 Results 

The simplest possible local, non-linear, binary synaptic rule is the well-known Willshaw 



prescription (Steinbuch, 1961 Willshaw et al, 1969 Palm, 1980). Here, the weight update 



equation is an extreme case of Hebbian learning, where a single coincidental firing activity 



at any given time [i (i.e., x\ 



1 and Xj 



1) is sufficient to arise long-term potentiation 



at the synaptic contact i — > j. As there is just one potentiation level, each synapse Wij is 
a binary variable, either at the 0-state (silent synapse) or at the 1-state (present synapse). 
After M pattern presentations, Wij is given by 



Wij = mm 



M 

£ 

ii=\ 



e{0,l}. 



(2) 



Originally proposed in the context of an associative network with one-step (non- 
iterative) synchronous retrieval, the 0-1 Hebb rule ^ has been employed as well to embed 
patterns in attractor networks with symmetric couplings w. 



i.i 



w 



In this case, if an 
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appropriate retrieval strategy is used so as to form large basins of attraction surrounding 
the desired fixed points, iteration generally leads to a more robust recall process, in terms 
of allowed cue distortion (given by a metric such as the Hamming distance g?#(x, x M ) = 
l^i — as well as in terms of resistance to stochastic synaptic failure, where the Wij 



may randomly switch states with a certain probability (Golomb et al 1990 Schwenker 



et al 1996; Sommer and Dayan 1998). 

For familiarity discrimination, there is no need per se to extract the whole pattern x M 
from the network; rather, what one seeks is a prescription to determine a binary (novel- 
familiar) answer starting from a cue x, given the information stored in the synaptic 
connectivity matrix W. 



models of familiarity (Bogacz and Brown 
quadratic form 



The discriminator proposed by Bogacz et al (|2001|) and studied in formal memory 

is based on the 



2003; Greve et al 2010) 



H(x) 



[x. 



-f){x 3 -f)e 



(3) 



usually referred to as the energy functiorj^ 



its mean corrected form (Amit et al, |1987 



of the network at a given state x, presented in 



Bogacz and Brown, 2002 Greve et al, 2009), 



where / = rn~ l E(^2 i Xi) is the coding rate, i.e., the expected fraction of firing units per 
pattern. As it has already been pointed out in the previous works, equation [3] has a 
network implementation and it is closely related to other measures of familiarity (see, 
e.g., the appendix of Greve et al 2010). 



In the proposed discrimination scheme, the desired binary decision is computed by 
'clamping' into the network state a certain input pattern x and then, without (or before) 
the retrieval dynamics takes place, by thresholding the resulting energy, i.e. 



D( 



[ff(x)<6] 



8]G{0,1}, 



(4) 



where is the binary random variable which is 1 if the argument holds and otherwise. 
An appropriate choice of a and G should ensure that, given a weight matrix W encoded 
according to a certain synaptic learning rule, as many as possible patterns belonging to 
S are assigned one of the two decision outcomes (say, one), and all the others to the 
opposite class (say, zero). 

It has been recently shown by Greve et al (2009) that for such discriminator, the 
asymptotically optimal (m — > oo and a size-dependent load M) local linear synaptic 
weight setting when we allow the Wij to assume real values is given by the covariance 



learning rule ( 


Amit et al, 


1987; 


Tsodyks and Feigel'man 


1988 


Dayan and Willshaw 


1991 


Palm and Sommer 


1996 


): 



Wij 



M 

oc J^(xf 

At=l 



/)(*?-/)£ 



(5) 



In this article we address the question of how well does the clipped Hebbian rule (|2| 
fare with a discriminator of the form Q. Specifically, for simplicity we redefine H letting 

1 As for bipolar patterns and symmetrical networks (uiy = Wji) with no self-couplings (wa = 0) there 
is a strong analogy with the Hamiltonian of the zero-temperature Ising model (Hopfield 1982[ Amit 
et all 119851). 
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a = 1, performing the double summation over all i,j, and dropping the mean correction, 



H(x) = — WijXiXj G Z, (6) 

i=l j=l 



recalling that each weight Wu is now a 0-1 binary variable. 



Following the analysis of the associative Willshaw network carried out by Knoblauch 



et al (2010), we proceed by calculating three essential quantities: the maximal number 
of patterns M € that the system can discriminate allowing a certain (known) error level, 
the network capacity C (in bits per synaptic contact), and the synaptic capacity C s (in 
bits per active synapse). We will then see that the Willshaw model becomes especially 
interesting regarding the latter quantity, as a modification to the clipped rule leads to 
the activation of a subset of few synapses within the full contact space of order m 2 . 



2.1 Maximal pattern load calculation for low activity levels 

The calculation of the maximal pattern load M e when the average activity is low (/ <C 1) 
can be performed analytically using a series of approximations which have been shown 



to be near-exact even for finite networks where m is not large (Palm, 1980 Knoblauch 



2008 Knoblauch et al, 2010). 



We consider the two usual simplified binary pattern generation scenarios: first, we 
deal with the case where every pattern presented to the network for learning has a 
fixed, known a priori activity level |x M | = YliLi x i = 
later (in section 2.4), we consider patterns where |x> 



A; as in the analysis of Palm (1980); 



is a binomially-distributed random 
variable with characteristic probability equal to the coding rate f = k/m, k being again 
a fixed known a priori parameter. In this case, although the activity of each pattern is 
allowed to vary, by construction the average level is mf = k and all neurons are activated 



equally and independently (Buckingham and Willshaw, 1992). 



With these statistics at hand we can determine the average weight matrix load, 



Pl = E(ltfy) = P(lUy = 
= 1 " (1 " f) M = 1 

« 1 -exp(-/ 2 M). 



1) = 1 - P(iWtf = 0) 
-exp(Mln(l - f)) 



(7) 
(8) 
(9) 



The approximation assumes that the coding rates are low, i.e., / 2 < 1. 

Clearly, as observed when employing the Willshaw rule to solve the associative task, 
Pi is a critical quantity: to recover information about the patterns in S one must control 
both the cardinality M and the sparseness parameter / so as to avoid pi = 1. It is useful 
to calculate M given pi , 



ln(l - pi) w —Mf 2 M w ~f- 2 ln(l - p x ). 



(10) 



Regarding familiarity detection in general, two types of error may occur: omission 
errors (denoted as '10' errors) whenever x e S but the system fails to classify the pattern 
as familiar; conversely, commission errors (denoted as '01' errors) when x ^ S but the 
discriminator indicates familiarity. For patterns with fixed (for all p) activity k and W 
set according to the Willshaw rule ([2]), there is a simple threshold setting which avoids 
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omission errors at all, i.e., a such that for all /x we have with probability one .D(x M ) 
For a familiar cue x e S corresponding to a certain learned x M we have 



*(*) = -EE 



WijXiX j 



=1 i=l 



-EE 



-fc 2 = e 



=1 j=i 



where the equality from (11) to (12) is valid since Wy = 1 ^ 3/i, x 



'111 



(12) 



1 AiJ 1 = 1. In a 



sense, 0vk is the familiarity discrimination threshold which corresponds to the classical 
Willshaw threshold |x| = k for the noise-free associative task (Willshaw et al 1969 Palm 



1980). 



When Qw is the discrimination threshold and x is a novel pattern, generated according 
to the same statistics as the x M but not presented for learning, if the non-zero Wij coincide 
with active i,j units enough such that H(x) reaches —k 2 , a commission error will occur. 
We can calculate this error probability resorting to p\\ assuming that the 'ones' in W 
were randomly and independently setQ 



An 



P(D(x) = 1 

(-9ff-t)/a 



x i S) « P(D(x) = 1) 



Pi 



k 2 /2 



(13) 
(14) 
(15) 



where the 1/2 correction comes from the symmetry in W. To reach our final expression 
(15), we approximate {k 2 — k)/2 by the leading term k 2 /2, although equation 



14 



would 

yield a better approximation to the true value of p i as the learning rule ^ sets the 
diagonal entries of W to one with high probability. 

While parametrising a memory device, to ensure the system performs the desired task 
correctly it is common to require that the probability of error remains below a certain 
bound. In the associative memory literature there are many criteria to enforce a quality 
level in the process; usually, the task parameters are found so that the error probability 
grows according to some controlled function of network size and the expected pattern 
activity level (Palm, 1980; Knoblauch et al, 2010). In the familiarity detection task, 



however, as there is no obvious reason to couple the probabilities to the parameters k 



and m, it seems reasonable to maintain p 01 and p w below a fixed level (Bogacz and 



Brown, 2002) 



To keep the error probability poi lower than a desired level poie, we establish the 
'breakdown' value M e for the pattern load, as a function of the coding rate /. Using the 
binomial approximation given by equation [15J we have 

Poi^J9 0l£ ^(l-exp(-/ 2 M))' c2/2 
yielding, with respect to M, 



Pou 



(16) 



M 



m 



In 1 



Poi 



2/k' 



(17) 



A well-known approximation employed e.g. in the analyses of Willshaw et al (1969); Palm ( |1980 ); 



(Knoblauch 20081. 



Knoblauch et al (2010), which is valid for sparse patterns with activity levels that are sublinear in m 
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which is the pattern capacity we sought. Note that in the large network limit m — > oo, 
for any coding rate such that k — >■ oo, M e is independent of the fixed error bound pon, 
as we have 

2m 2 In k , . 

M e » . (18) 

Notice how the maximal pattern load is a function of /c and m. This result is in 
contrast with the real- valued network employing the covariance rule, where the familiarity 



discrimination capacity is essentially independent of the pattern activity level (Bogacz 



and Brown 


200S 


!). Just as in 


(Willshaw et al 


1969 


Palm 



1980 Nadal and Toulouse, 1990 Knoblauch et al, 2010) 



however, we find a dependence of M t on k. With the binary synapses induced by Willshaw 
learning, it is clear that M t is maximised in the sparse coding regime /< 1; the actual 
optimal activity level parameter k opt is just a function of poie an d can easily be found 
numerically. To gain additional insight on the typical size of k opt , let us obtain an 
approximation for the pattern capacity, 



m 



(2mA; -In (-2 In pou)), 



(19) 



which is maximal when 



k = exp - (1 + In (-2 In poie)) ~ k opt 



(20) 



Recalculating M t with k = k opt , we find that 



maxM e w 

k 2e In poie 



m 



« 0.18(-lnp i € ) _1 m 2 



(21) 
(22) 



Just to illustrate the result above, if one sets the desired error rate at p ie — 0.01, the 
obtained breakdown quantity of patterns per synapse becomes about M e /m 2 « 0.04. 

Although 'greedily' maximising M e leads to an extensive quantity of patterns per 
synapse, this approach also imposes a heavy coding restriction in the form of quite small 
values for k and an optimising expression that does not vary with m, a parametrisation 



that is referred to by Knoblauch et al (2010) as the ultra-sparse coding regime. In the 



next sections we proceed to richer performance measures where the required underlying 
resources and the Shannon information of the task are also taken into account. 



2.2 Classical network capacity 



The commission error probability poi can as wei l be used to calculate the traditional 
network capacity measure C in bits per synaptic contact. Here there is a fundamental 



difference between the associative and familiarity tasks, as observed by Barrett and van 



Rossum 


(2008 


); 


Greve et al 


(2009 



mit' at most one bit per learned pattern (the perfect output of -D(x)), instead of order 
k bits per pattern as in the associative case (Palm, 1980 Knoblauch et ah 2010[ ). The 



optimal local, linear, additive covariance rule (that induces real-valued synaptic weights) 
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can then only obtain 0.057 bits per synapse in the M — > oo errorful regime (Greve et al 



2009), which is rather low when compared to the 0.72 bits per synapse that the same rule 



can achieve in the high fidelity pattern association task (Palm and Sommer, 1996) 



The analogy at hand is to interpret the familiarity network as a discrete binary chan- 
nel which transmits novel and familiar patterns with a certain error probability, and then 
calculate the information-theoretic channel capacity, which is the maximal mutual infor- 



mation (Shannon, 1948 Cover and Thomas, 2006) normalised by the number of required 



synaptic contacts, 



C 



I(X\ . . . , X u , . . . , X n ; Y l , . . . , Y u , . . . , Y i 

m 2 



(23) 



Here X u G {0, 1} is a binary random variable indicating whether the u-th presented 
pattern is familiar (X w = 1) or novel (X w = 0), and Y" = D(x") G {0, 1} is the network 



output for the u-th. pattern. As in previous work (Barrett and van Rossum , 2008 ; Greve 



et al, 2009 ), we assume that f2 = 2M patterns are presented and an equal prior probability 
of a pattern being familiar or novel P^X" = 0) = P(X U = 1) = 1/2. Besides allowing 
for a direct fair comparison with the previously obtained results, a prior model with 
equiprobable pattern classes maximises the channel capacity when the conditional error 
probabilities are equal p w = p i- 111 °ur case, assuming the network is parametrised for 
high fidelity, this choice is approximately optimal, as we have p\o = and poi ~ 0. 

Since we are 'transmitting' M learned and M novel patterns independently gener- 
ated according to the statistics of section |2.1[ the process can be decomposed into 2M 



transmissions of a single (say, the u-th) pattern, 



C 



2M 
m 2 
2M 
m 2 



I(X U ; Y u 



(1 + poi) ld(l + poi) - Poi Idpoi 



(24) 
(25) 



where po% is the commission error probability, defined in (15) as a function of the task 
parameters m, k, M. The derivation of the single-pattern mutual information is given in 
appendix |XJ a similar calculation has been carried out in the single-neuron information 
maximisation framework of Barrett and van Rossum (2008), in a comparison of the 



Willshaw rule with more elaborate stochastic synaptic learning. 

Unfortunately, unlike the network capacity achieved in the associative case, in our 
task C is largest for finite small m (see figure [T|, but vanishes when m — > oo, for any 
activity level function k that increases with m. 

To show this, let us take an arbitrary, finite probability poie close to zero, to keep 



the discrimination error from growing large; in this case, the bracketed quantity in (25) 
becomes approximately one. Then, the capacity becomes 



C 



-2k~ 2 In 1 



Poi 



2/k" 



(26) 



In the limit k, m — > oo, we can take M e from equation [T8j the capacity C no longer 
depends on the error bound poie and is given by 



C 



4 In A; 
k 2 ' 



(27) 
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Figure 1: Network capacity C in bits per synaptic contact vs. network size m (in loga- 
rithmic scale) for a variety of activity level orders, with pattern load M e given by (17) at 
conditional error rate poi e = 0.01. For k of order logm the capacity is stable, yet slowly 
decreasing towards zero as predicted by the asymptotic analysis. Less sparse patterns 
(e.g., when k = \fm) lead to low capacity even for small m. When the activity level 
increases to k — m 2 / 3 the network capacity becomes near-zero for any network size. For 
Poie = 0.01, integer-constrained numerical optimisation of C with respect to k while M 
is accordingly set at M e reveals that the maximum C ~ 0.11 is achieved when k = 4, 



a result which is in agreement with the previous findings of Barrett and van Rossum 
((20081). 



We have reached a result which describes a qualitative behaviour that is rather differ- 
ent from the one found in the typical long-term associative memory task, where capacity 
is clearly a function of network size, and an increasing one when the activity level k is of 



correct order (Willshaw et al, 1969; Palm, 1980; Dayan and Willshaw, 1991). For a given 



fixed probability error poiej the capacity C of the Willshaw network for discrimination is 
not directly a function of network size m. In our case, for any order of k as an increasing 
function of m, in the limit of m — > oo, the capacity of the system collapses, even if the 
limit is reached slowly. One can avoid near-zero capacity for large networks only in the 
ultra-sparse regime, where k is kept small and constant (e.g., k — 4) and the capacity 
remains non-zero (and independent of m). 



2.3 Synaptic capacity 

Let us consider now the synaptic capacity measure C s (in bits per active synapse) recently 
suggested by Knoblauch et al (2010). Here, only functional synapses (i.e., non-zero 



synaptic connections which play a role in the network task) are considered to count; 
silent synapses are either assumed to be wired but metabolically cheap to maintain or 
even that the network is endowed with structural plasticity and is able to prune irrelevant 
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synapses and rewire new connections as needed (e.g., Poirazi and Mel, 2001; Chklovskii 



et al , 2004 Holtmaat and Svoboda, 2009). In the simple pattern statistics we consider, 



we obtain C s by renormalising the network capacity C (as given by equation 25 ) by a 
factor F denoting the fraction of functional synapses: 



c 

F 



2M 
Fm 2 



(2f 



In the classical Willshaw model, the functional elements correspond to the 1-synapses, 
the expected fraction of which is pi (our F, then) as defined in equation [8j However, at 
the maximal pattern load M e , even when the discrimination error bound pou is kept low, 
most synapses are in the potentiated state. We can see this by rewriting pi as a function 
of poie! when M is given by M e , combining equations (JTBl and (16), we obtain 



Pi ~ Poi 



2/k 2 



» 1/2, 



(29) 



which approaches unity as we let k — > oo and is already larger than 1/2, even for small 
Poie close to zero and low activity k. Once again, in the limit m — > oo, when k is 
allowed to vary as a function of m, we have Fm 2 — > m 2 , which implies a capacity collapse 
C s — > C — > 0. The differences between C s and C for finite m are also rather small, as 
illustrated by figure [2] 




Idm 



Figure 2: The ratio F = C/C s = p\ between network and synaptic capacities for the 
Willshaw model, when the error probability bound is pou = 0.01, shown for different 
activity functions k(m). Since the maximal network capacity for each pair (m,k(m)) 
is achieved at a higher connectivity level pi as the coding rate increases, the relative 
advantage of considering only functional synapses becomes negligible. 



However, parametrisations leading to the so-called dense potentiation regime pi — > 1 
(as m — > oo) can be quite advantageous in terms of synaptic capacity when the connectiv- 
ity matrix W is set according to the inhibitory Willshaw learning rule. In the associative 
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task, this rule is able to achieve a synaptic capacity already an order of magnitude larger 
than that of the original excitatory model for reasonable pattern activity k and plausible 
network size, and arbitrarily higher values in large networks with appropriate activity 



levels (Knoblauch et ah 2010). Furthermore, it is one of the limit cases of the optimal 



non-linear Bayesian local synaptic update (Knoblauch 2011). 

The inhibitory rule is a subtle variation of equation ([2j), as the synaptic states set by 
the original rule are simply switched: each 0-synapse (encoding non-coincidental activ- 
ity) becomes functional as an inhibitory synapse = — 1; conversely, each 1-synapse 
becomes silent Wij = 0. We denote the synaptic connectivity matrix of the inhibitory 
variant by W; after M pattern presentations, the state of synapse i — > j is 



M 



Wij = — 1 = max ^— 1, x^Xj — lj , (30) 

where is the 0-1 weight that would be induced by the excitatory rule. 

The energy for a familiar cue x 6 S is now 0/ = H(x) = 0, following the reasoning 
which led to the derivation of Qw Novel patterns should activate the inhibitory synapses 
so that for a given x ^ S, H{5t) > = 0/; thus, the discrimination function Q remains 
unchanged. The (classical) network capacity of the inhibitory network is 

Notice that the excitatory and inhibitory networks are functionally equivalent and 
that the (classical) network capacities of both implementations are equal, i.e., C = C. 
It is the synaptic capacity C s of the inhibitory network the fundamental quantity to 
observe, as it is inversely proportional to the fraction F of inhibitory synapses 

C/C s = P( Wij = -1) = 1 - Pl = (1 - f) M (31) 
ss exp(-/ 2 M) = F, (32) 

where we have used approximation ^ for p\. 

Alternatively, F can be obtained as a function of the error probability bound poie 



from (29) 



1-PiWl-jW 1 /*. (33) 



Expanding the network capacity C as in (26) and inserting in (28) the factor F we have 



just derived, we arrive at the synaptic capacity of the inhibitory network as a function of 
k and poie : 

„ s 21n(l-p ie 2/fc2 ) 

c kHi- P01 *n ' (34) 



which is approximately 



C s w 21nfc-ln(-21np i e ) 

In Pole 



the approximation improving as k increases. 

Asymptotically, letting k — > oo, the capacity further simplifies to 



~o 2\nk 

C s w . 36 

In Pole 
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Figure 3: Synaptic capacity C s (in bits per synapse) for the inhibitory Willshaw rule in 
the same conditions of figure [TJ calculated through normalisation of C (cf. equation 25 ) 
by F = 1—px. In the moderately-sparse coding regime (supra- logarithmic k(m)), which 
would otherwise lead to quickly vanishing C and C s in the excitatory Willshaw model, 
the inhibitory network is capable of storing more than one bit per functional synapse 
already at surprisingly small m. As discussed in the main text, the synaptic capacity 
increases with m, as long as k is as well an increasing function of m. 



Notice that for large k, the k 2 factor that was hampering the capacity in the excitatory 



model has disappeared, both in the finite case (35) and in the large network limit (36). 

What is remarkable is that as m — > oo, the synaptic capacity C s diverges for any 
k that increases with m, assuming that the binomial approximative theory we employ 
remains valid. For finite networks and activity levels of order mP with < p < 1, C 
already surpasses unity for small- and medium-sized systems (see figure [3]). Even for 
'classical' sparseness where k is of logarithmic size, the capacity increases with network 
size (recall that C was always vanishing for any non-constant k) and is always well above 
zero. 

To picture the difference in capacities, for a network of size m = 10 6 , an error rate 
of poie — 0.01 and a logarithmic activity level k = lnm « 14, we obtain the network 
capacity C ~ 0.03, while the synaptic capacity is C s ~ 0.70. If the coding level rises to 
a more realistic setting such as k = y/m = 1000, the difference becomes drastic, as we 
have C « 2.4 x 10" 5 and C s w 2.6. 

There is a major qualitative change when the excitatory rule is replaced by the in- 
hibitory one. Since F — > as k — > oo, in the limit of large networks the system is 
characterised by few synapses carrying a great amount of information. For moderate 
sparseness where k is of the form m p , < p < 1, and any setting of p, the synaptic 
capacity is (asymptotically) 

C s w 2p (- In poie)" 1 In m, (37) 
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which grows with m as fast as the corresponding asymptotic bound for the associative 
(see Table 1, Knoblauch et al 2010), although here the high fidelity requirement 



case 



enforced through the constant poi e > affects more strongly the obtained capacity Note 



that the maximal pattern load is still large; substituting k for m v in equation [17] we find 

^- 2p ln(l-p 01 ^ 2p ), (38) 



M r 



-m 



which becomes, in the limit of large networks m — > oo, 

M e « 2p-m 2 - 2p -Inm. (39) 

When k is of order y/m, asymptotically we obtain the pattern capacity M e = mlnm, 
which is still supralinear in m, while the number of required functional synapses F tends 
to zero. 

In summary, considering that only functional synapses are relevant for the capacity 
measure, the Willshaw-type inhibitory learning rule leads to efficient familiarity discrim- 
ination in the limit of synaptic precision (two-state synapses). Interestingly, as in the 



pattern association task ( Knoblauch et al| 2010), the network achieves high storage ca- 
pacities for coding rates of the form / = k/m = m p_1 = m~ a , < a < 1, which for most 
cortical regions are (arguably) more realistic than the logarithmic levels required by the 
excitatory rule. If one accepts the logarithmic coding requirement, then the inhibitory 
model offers a pattern load that grows as 2m 2 In lnm (lnm) -2 (see equation 18), still 
achieving capacities around one bit per synapse while maintaining high fidelity in the 
discriminator output and low anatomical connectivity. 



2.4 Corrections for binomially-distributed activity levels 

To reach the former results we have assumed that the activity level per pattern was fixed 
at exactly k firing neurons, at any given time, i.e., |x| = |x M | = k was kept constant across 
all \i. Thus, all patterns were permutations of each other chosen from the (Tj possible 
configurations as in the analysis of Palm (1980). However, from the biological modelling 



perspective it might be more reasonable to take the assembly size as a random variable. 
In this section we let |x M | and |x| assume a binomial distribution with characteristic 
probability / = k/m, so that the mean activity level is still k/m, but the activity levels 
are allowed to vary. 

In this case, the treatment is harder since we have to replace the constant parameter 
k in the capacity analyses by a random variable. We denote by a star superscript '*' 
whenever appropriate to differentiate quantities where |x| and jx^l are random variables. 

First, since the patterns have varying activity levels, to recover the 'no-omission- 
errors' property pio = 0, we adjust the discrimination threshold for the excitatory network 
accordingly on a cue-by-cue basis, 



6 



w \ 



I ~ |2 
X 



(40) 



denoting the binomially-distributed pattern activity level by random variable Z. The 
variable threshold could be implemented, alternatively, introducing an external feedfor- 
ward inhibition field in the energy read-out, corresponding to a translation in the energy 
function, 



H*( 



H(±) - 0* 



(41) 
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implying H*(x) = for familiar x G S, as in the inhibitory Willshaw network implemen- 
tation. 

When the weights are set according to the inhibitory rule (30), there is no need for the 
explicit external field, as the energy reads immediately H(Sc) = if*(x) and the threshold 
can be simply set fixed 0} = 0/ = as before. For the excitatory network, however, 
the variable threshold control is fundamental to stabilise the energy, as can be seen for 
instance through inspection of the variances of non-translated vs. translated energies 
(not shown here). 

In the following, pb(x;ti,p) = ( n )p x (l — p) n ~ x is the probability mass function of the 
binomial distribution. We first approximate the conditional error probability by 



(42) 



M-l 



J2PB(i;M-l,f): 



i=0 
m 

x5> fl (*;m,/)(l-(l /r 

z=l 



(43) 



which is the expression found by Buckingham and Willshaw (1992) for the associative 



task under the same statistical assumptions, now adjusted to the quadratic familiarity 
discriminator; the full analysis of the distribution is due to Knoblauch (2008). Notice 



that equation 43 is just an approximation, as the analyses of the associative case assume 
independence among the columns of W. To compute the exact conditional error proba- 
bility of the quadratic discriminator, however, would require analysing a k x k sub-matrix 
of W, which is a difficult combinatorial problem we do not solve. 



Approximating the exponent and employing the binomial approximation, as in (15) 
we obtain 



Poi 



^^Pb(z; m, k/m)pi z ^ 2 > p 1 



k 2 /2 



(44) 



Pi being the expected matrix load as give n by (|8|). Notice th at in general, as expected 



and as in the case of the covariance rule (Bogacz and Brown, 2002 Greve et al , 2009), 



the error probability is never smaller than when the activity level is kept constant. 

It is hard to obtain the pattern load M* as a function of pgi without writing the 



summation in (44) in closed-form, which is difficult to accomplish due to the quadratic 
exponent. However, we can find numerically the M* such that the commission error 
probability is approximately equal to some arbitrary bound close to zero (say, p ie — 
0.01), from which we compute the corresponding synaptic capacity C . Then, to assess 
the impact of letting k vary, we can see how the ratio 7 = C s */C s evolves as m grows, 
for different mean activity levels. 

As plotted in figure [4], 7 approaches unity as the network size parameter m increases, 
and quickly so when the patterns are moderately sparse (k = m p ). For small, finite m 
there is a rather large factor affecting M* that originates in the disorder introduced by the 
variability in the activity levels. This factor can be (approximately) as large as 1/5 for k 
of logarithmic size but attenuates as m grows. Our numerical analysis strongly suggests 
then that the system remains qualitatively intact and the former conclusions drawn for 
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Figure 4: The ratio 7 = C s */C s between the obtained synaptic capacities (calculated 
through normalisation by F of the network capacity of equation 25 ) in the binomial- and 
fixed-activity pattern generation scenarios. Connecting (interpolating) lines are visual 
aids; solid markers represent the ratio of capacities computed for actual measured M* 
(binomially-distributed Z) vs. theoretical maximal M e (fixed z = k) as given by (17). 



The pattern load M* was found numerically by bisecting search over equation [44] with 
the target p^ x set at p ou = 0.01. The relative difference between C s * and C fades as m 
grows and when the expected activity level order k(m) increases. 



fixed k should hold, even for finite networks, although the discriminator is subject to a 
correcting factor which decreases the capacity of the model. 



3 Discussion 

If one restricts the model to operate with two-state synapses, a well-known and simple lo- 
cal update scheme can offer a surprising familiarity discrimination capacity, provided that 



the firing rates are kept low. We have analysed both the original Willshaw rule (Willshaw 



et al 1969) and a variation for inhibitory synapses recently proposed by Knoblauch et al 



(2010) 



At high pattern loads, the traditional excitatory implementation imposes high con- 
nectivity and a heavy coding restriction; we have seen that for large enough networks 
the network capacity eventually approaches zero unless the activity levels are kept con- 
stant (independent of network size) and very low at all times. For neural populations 
of moderate size and low activity levels (e.g., of logarithmic order), one can obtain in 
the high-fidelity regime information and pattern capacities that are comparable to those 
found for the optimal linear rule. In this case, we find a rather low overall stored informa- 
tion content per synapse in comparison to the typical values achieved in the associative 



memory task, a fact that has already been discussed by Barrett and van Rossum (2008); 
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Greve et all d2009b. 



Taking into consideration that in the long-term the brain might prune silent synapses 
(that play a non-functional role and are mere spatial candidates for future potentiation) 
in stable memories and then place synapses in new locations as needed, Knoblauch et al| 
(2010) suggested the so-called synaptic capacity measure where only functional resources 
are taken into account. The critical observation we reach in our work is that the familiarity 
detection task parametrisation leads naturally to the dense potentiation regime, even for 
logarithmic sparse coding, which explains the large capacities achieved by the inhibitory 
Willshaw rule. In this case, we recover the increasing capacity function (with respect to 
network size) that is typical of the associative task. 

Of course, another question altogether is to locate such structures in the actual central 
nervous system, and to ascertain if the less conservative inhibitory rule (where connec- 
tions corresponding to previous coincidental activity are depressed and then pruned) is 
plausible and if it is actually observed in real synapses. It is worth noting that we have 
switched to an inhibitory circuit so that the energy 'readout' mechanism Q could remain 
intact, except for a change in the threshold. However, one could consider a sign-reversed 
connectivity matrix, i.e., an excitatory network implementation with exactly the same 
couplings as the inhibitory one. In this case, the less well-known inhibitory synaptic 
plasticity processes would be avoided, but the task would change, as a stronger excita- 
tory signal would be elicited in the presence of novel patterns. Such a model could be 
appropriate to describe a novelty detection mechanism in regions where stronger excita- 
tory activity is observed as a response to non-familiar stimuli. Our analysis should hold, 
as only the number (and not the type) of required functional synapses matters for the 
synaptic capacity measure we have considered. 

Following the previous studies of familiarity detection, our analysis has focused on 
simple high-level modelling assumptions that could be refined if the biological implications 
require so. For instance, one could consider incorporating well-known features of more 
realistic or detailed models, such as stochastic synaptic transmission, arbitrary query 
noise, or spiking neurons. 
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A Derivation of the mutual information per pattern 

For a given pattern transmission described by the true class (novel-familiar) of the pattern 
X w and the network output Y u , we can define the mutual information I(X W ] Y w ) in 
terms of the discriminator entropy IiY^ 1 ) and the conditional entropy IiY 1 ^ \ X w ) of the 
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discrimination outcome given the correct classification, 

I(X W ] Y") = I(Y U ) - I{Y W | X u ). 



(45) 



Let us denote by I(p) = — pldp — (1 — p) ld(l — p) the Shannon entropy in bits of a 
binary random variable X with P(X = 1) = p and P(X = 0) = f — p. Then, we can 
write the entropies in (45) with respect to the prior probability p = P{X^ = 1) and the 

(46) 



error probabilities pio and poi ( [Cover and Thomas 2006| ), leading to 

I(Y") = J(p(l - p w ) + (1 - p)p 01 ) = I(p + (1 - p)p 01 ) 
and 



| x u ) = P i(p w ) + (i - p)/(p i) = (i - p)/(poO, 



recalling that p i0 = under the threshold setting ( f f ) . 



Inserting the expanded entropies into expression 
probability of a pattern being familiar), we obtain 



45, and substituting p 



I(X"; Y u 



1+P0l) - o J (P 



'01 J 



i ((1 + poi) ld(l + Poi) + (1 - poi) ld(l - Poi)) 
2 ( _ Poi ldpoi - (1 - Poi) ld(l - poi) 



(47) 
1/2 (the 

(48) 

(49) 
(50) 



which is the expression presented in the main text (equation 25). 
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